Computer-based processing of text documents is performed for a variety of purposes. Two example categories of document processing applications are natural language processing applications and document repurposing. Natural language processing generally involves providing accessibility to the information contained in a text document, for example, by way of indexing information, generating speech from the text, or translating text from one language to another. Specific tasks associated with natural language processing applications include part-of-speech tagging, grammatical parsing, keyword extraction and information retrieval, and text summarization.
Document repurposing involves reusing the information of a document in a different context. For example, in one form of document repurposing a document is converted from one format, such as a word processor based format, to another format, such as HTML. Another example of document repurposing involves converting a paper-based document to a text-based electronic form using a scanner and OCR technology.
The presence of extraneous text, for example, headers and footers, in documents may cause problems in natural language processing and document repurposing applications. Headers and footers found on pages of a document often include information such as page numbers, dates, revision numbers, chapter headings, as well as decorative information. In some document text formats, the text of the headers and footers is embedded in the text of the body of the document and has no associated identification tags. In natural language processing applications, this information will change the context of neighboring words in part-of-speech tagging, introduce grammar errors, and distort statistics accumulated in keyword extraction and information retrieval. In document repurposing applications, headers and footers will interrupt the flow of the text of the body of the document.
A system and a method that address the aforementioned problems, as well as other related problems, are therefore desirable.